AITopics | discrete action space

A Detailed Proof 1 A.1 Proof of Theorem 4.1

Neural Information Processing SystemsFeb-17-2026, 23:20:32 GMT

We can compute the fixed point of the recursion in Equation A.2 and get the following estimated Then we compare these two gaps. To utilize the Eq. 4 for policy optimization, following the analysis in the Section 3.2 in Kumar et al. By choosing different regularizer, there are a variety of instances within CQL family. B.36 called CFCQL( H) which is the update rule we used: In discrete action space, we train a three-level MLP network with MLE loss. In continuous action space, we use the method of explicit estimation of behavior density in Wu et al.

artificial intelligence, cql, machine learning, (14 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

f0606b882692637835e8ac981089eccd-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-12-2026, 18:56:04 GMT

agent, scenario, skill selector, (14 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)

Add feedback

In this section, we present detailed proofs for the theoretical derivation of Thm. 1, which aims to solvethefollowingoptimizationproblem: min

Neural Information Processing SystemsFeb-9-2026, 23:15:29 GMT

These assumptions are not strong and can be satisfied in most of environments includes MuJoCo, Atarigamesandsoon. Let f be an Lebesgue integrable function, P and Q are two probability distributions, |f| C,then EP(x)f(x) EQ(x)f(x) CDTV(P,Q) (5) Proof. Suppose there are two actions a1, a2 under state s, and let Q1(s,a1) = u, Q1(s,a2) = v. In this way, we can derive the upper bound of Ea π2Q1(s,a) Ea π1Q1(s,a)asabove. Since both sides of the above equation have the same minimum (here the minima are given by Qk = Q), we can replace the objective in Problem 2 with the upper bound in Eq. (10) and solve therelaxedoptimizationproblem.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

Neural Information Processing Systems

Country:

Oceania > Australia > New South Wales > Sydney (0.04)
Europe > France > Hauts-de-France > Nord > Lille (0.04)
Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)

Add feedback

7749f9c0d5ff109231be21e910a3ced2-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-9-2026, 22:29:07 GMT

artificial intelligence, equation, machine learning, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)

Add feedback

Direct Policy Gradients: Direct Optimization of Policies in Discrete Action Spaces

Neural Information Processing SystemsDec-24-2025, 16:21:45 GMT

Direct optimization (McAllester et al., 2010; Song et al., 2016) is an appealing framework that replaces integration with optimization of a random objective for approximating gradients in models with discrete random variables (Lorberbom et al., 2018). A* sampling (Maddison et al., 2014) is a framework for optimizing such random objectives over large spaces. We show how to combine these techniques to yield a reinforcement learning algorithm that approximates a policy gradient by finding trajectories that optimize a random objective. We call the resulting algorithms \emph{direct policy gradient} (DirPG) algorithms. A main benefit of DirPG algorithms is that they allow the insertion of domain knowledge in the form of upper bounds on return-to-go at training time, like is used in heuristic search, while still directly computing a policy gradient. We further analyze their properties, showing there are cases where DirPG has an exponentially larger probability of sampling informative gradients compared to REINFORCE. We also show that there is a built-in variance reduction technique and that a parameter that was previously viewed as a numerical approximation can be interpreted as controlling risk sensitivity. Empirically, we evaluate the effect of key degrees of freedom and show that the algorithm performs well in illustrative domains compared to baselines.

direct optimization, direct policy gradient, name change, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.59)

Add feedback

Marginal Utility for Planning in Continuous or Large Discrete Action Spaces

Neural Information Processing SystemsDec-23-2025, 19:06:59 GMT

Sample-based planning is a powerful family of algorithms for generating intelligent behavior from a model of the environment. Generating good candidate actions is critical to the success of sample-based planners, particularly in continuous or large action spaces. Typically, candidate action generation exhausts the action space, uses domain knowledge, or more recently, involves learning a stochastic policy to provide such search guidance. In this paper we explore explicitly learning a candidate action generator by optimizing a novel objective, marginal utility. The marginal utility of an action generator measures the increase in value of an action over previously generated actions.

discrete action space, marginal utility, name change, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.58)

Add feedback

Optimizing Electric Vehicle Charging Station Placement Using Reinforcement Learning and Agent-Based Simulations

Nguyen, Minh-Duc, Le, Dung D., Nguyen, Phi Long

arXiv.org Artificial IntelligenceNov-4-2025

The rapid growth of electric vehicles (EVs) necessitates the strategic placement of charging stations to optimize resource utilization and minimize user inconvenience. Reinforcement learning (RL) offers an innovative approach to identifying optimal charging station locations; however, existing methods face challenges due to their deterministic reward systems, which limit efficiency. Because real-world conditions are dynamic and uncertain, a deterministic reward structure cannot fully capture the complexities of charging station placement. As a result, evaluation becomes costly and time-consuming, and less reflective of real-world scenarios. To address this challenge, we propose a novel framework that integrates deep RL with agent-based simulations to model EV movement and estimate charging demand in real time. Our approach employs a hybrid RL agent with dual Q-networks to select optimal locations and configure charging ports, guided by a hybrid reward function that combines deterministic factors with simulation-derived feedback. Case studies in Hanoi, Vietnam, show that our method reduces average waiting times by 53.28% compared to the initial state, outperforming static baseline methods. This scalable and adaptive solution enhances EV infrastructure planning, effectively addressing real-world complexities and improving user experience.

electric vehicle, machine learning, reinforcement learning, (20 more...)

arXiv.org Artificial Intelligence

2511.01218

Country: Asia > Vietnam > Hanoi > Hanoi (0.25)

Genre: Research Report > New Finding (0.93)

Industry:

Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)
Transportation > Electric Vehicle (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

D2C-HRHR: Discrete Actions with Double Distributional Critics for High-Risk-High-Return Tasks

Zhang, Jundong, Situ, Yuhui, Zhang, Fanji, Deng, Rongji, Wei, Tianqi

arXiv.org Artificial IntelligenceOct-21-2025

Tasks involving high-risk-high-return (HRHR) actions, such as obstacle crossing, often exhibit multimodal action distributions and stochastic returns. Most reinforcement learning (RL) methods assume unimodal Gaussian policies and rely on scalar-valued critics, which limits their effectiveness in HRHR settings. We formally define HRHR tasks and theoretically show that Gaussian policies cannot guarantee convergence to the optimal solution. To address this, we propose a reinforcement learning framework that (i) discretizes continuous action spaces to approximate multimodal distributions, (ii) employs entropy-regularized exploration to improve coverage of risky but rewarding actions, and (iii) introduces a dual-critic architecture for more accurate discrete value distribution estimation. The framework scales to high-dimensional action spaces, supporting complex control domains. Experiments on locomotion and manipulation benchmarks with high risks of failure demonstrate that our method outperforms baselines, underscoring the importance of explicitly modeling multimodality and risk in RL.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2510.17212

Country: Asia (0.28)

Genre: Research Report > New Finding (0.68)

Technology: